Sears Roebuck Kit Home Report

M. Fawcett December 23, 2021

Short History of Sears Kit Homes

Between 1908 and 1942, the Sears Roebuck company sold houses in the form of build-it-yourself kits. Huge kits were prepared at factories, mostly in Illinois, and shipped in a railroad freight car to customers all over the country. Each kit contained all the materials needed to build a house. Customers lugged their 25 tons of numbered precut lumber, shingles, wall board, flooring and so on, from the freight car to their building site, and got to work following the instructions in the 75 page construction guide. Happy buyers considered the homes attractive, well designed and economical - 30 percent less costly than a similar, existing home.

This unlikely seeming business concept was successful and produced sales of between 70,000 and 100,000 kits. Because Sears was a mail-order marketer of all manner of merchandise, they also sold tools used to build the houses, and then the appliances, furniture and fixtures that filled them when they were completed. More than one observer has called Sears the Amazon of its time.

But success got cut short when the Great Depression and World War II took away much of the demand for new housing. After World War II a new trend in housing, tract housing, took over and the Sears kit home business faded into oblivion. The Sears company itself has lately been fading into oblivion. Having declared bankruptcy in 2018, Sears barely exists at all now except in legal proceedings while its few remaining stores are gradually liquidated.

With no official list of where kit homes were built, a few fascinated enthusiasts now hunt for them and share their discoveries through social media and Websites. They find the story of these homes so charming that they don't want to see them lost to history.

This report describes where kit homes have been found and offers clues as to where others are likely to be be found. It will look at things like street name, distance from railroads, local economic factors and population characteristics. There is no attempt at statistical robustness in this analysis.

Footnote... My original goal was to create a computer program that could analyze a picture of a house and tell you if it was a Sears kit home and its model name. This turned out to be too hard a problem due to the large number of models (around 370) produced over the years. Another goal was to have a computer program "crawl" through Google Street View images of houses and pick out ones that had a high probability of being a Sears kit home. This turned out to be economically infeasible because Google charged 7/10ths of a cent every time my computer program used Street View to capture an image. Scanning 10,000 images of houses cost me \$70.00. Scanning 1 million images would have cost \\$7,000.00.

This computer program was written in the Python language. Another software package called QGIS was used to prepare some of the data displayed in the maps. The US Census Bureau provided neighborhood social and economic data.

In [1]:
# Load Python modules needed for the analysis
import pandas as pd   # for dataframe manipulation
import numpy as np  # for numerical analysis
import matplotlib.pyplot as plt   # for generating plots and graphs
from matplotlib.pyplot import figure  # for modifying appearance of plots & graphs
import requests   # to make http post requests to the US Census geocoder
import io  # for working with I/O streams and allow conversion of geocode response to dataframe 
import csv  # reading/writing csv files
import pickle as pk # to store and retrieve dataframes on disk
import csv  # to read text files
import requests # to make http requests for data using census web API
import os   # to list contents of disk drive folders
import sys  # for managing system options
import folium  # the mapping package
from folium import plugins  # to allow cluster markers on maps
import seaborn as sns   # for fancy plotting
from IPython.display import Markdown as md  # for embedding variables in markdown cells
/Users/mitchellfawcett/anaconda3/lib/python3.7/site-packages/pandas/compat/_optional.py:138: UserWarning: Pandas requires version '2.7.0' or newer of 'numexpr' (version '2.6.9' currently installed).
  warnings.warn(msg, UserWarning)
In [2]:
# See http://blog.nextgenetics.net/?e=102 for hiding computer code in this report.
In [3]:
from IPython.display import HTML
In [4]:
HTML('''<script>
code_show=true; 
function code_toggle() {
 if (code_show){
 $('div.input').hide();
 } else {
 $('div.input').show();
 }
 code_show = !code_show
} 
$( document ).ready(code_toggle);
</script>
The raw code for this IPython notebook is by default hidden for easier reading.
To toggle on/off the raw code, click <a href="javascript:code_toggle()">here</a>.''')
Out[4]:
The raw code for this IPython notebook is by default hidden for easier reading. To toggle on/off the raw code, click here.
In [5]:
# Settings to improve the display of tabular results
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
np.set_printoptions(threshold = sys.maxsize)

Contents of the Report

  • Preparing the data
  • Count of kit homes by state
  • Count of kit homes by model name
  • Map of Kit Home Locations
  • Most frequent street names
  • Distance to railroad
  • Age of structures of neighborhood
  • Income of neighborhood
  • Education level of neighborhood
  • Owned versus rented of neighborhood
  • (Value of homes of neighborhood)

Preparing the data

A spreadsheet (maintained by Lara) of around 13,000 kit home locations is the basis of this analysis. The list of locations is as of October 27,2021.

The first step was to load the list of kit home locations from the spreadsheet into memory and do some minor tidying of the data. Overall the data seems pretty clean. Generally the values in each colummn were entered in a consistent way so I didn't have to do much to make good use of the data.

The main cleanup was in a column called "Auth?". I changed the name to "Auth" and all the values in the column were converted to uppercase so they are either "YES" or "NO". The Auth column indicates whether a location has gone through a confirmation step showing there is evidence that it is truly a Sears kit home.

Below is a sample of the data from the spreadsheet after the cleanup.

None of these changes effect the original spreadsheet. All of this work is being done on an in-memory copy of the spreadsheet data.

In [6]:
# Read the Excel file of kit home locations into a Pandas dataframe.
address_df = pd.read_excel(r"Sears Roebuck Houses.xlsx", sheet_name = "Locations")

# Add a row number to each address.  A unique number for each row will be needed by the 
# US Census Bureau geocoder
address_df.insert(loc=0, column='row_num', value=np.arange(len(address_df)) + 2) 
# the +2 is to add 2 to each row number to account for the header row and row "0".
# I want the row_num value to be aligned wth the row number in the original Excel file.

# Remove the "?" from the Auth? column name.
address_df.rename(columns={"Auth?": "Auth"}, inplace = True)

# Tidy up the values in the "Auth" column
# Change the "nan" to "N/A".
address_df["Auth"] = address_df["Auth"].replace(np.nan, 'N/A', regex=True)
# Make all the values in the Auth? column uppercase
address_df["Auth"] = address_df["Auth"].apply(lambda x: x.upper())

# Examine some of the cleaned up data
address_df.head()
Out[6]:
row_num Model Address City State Year Auth Added Notes Link #1 Link #2 Twp/Borough/Neighborhood County and State
0 2 Windsor 105 Meadow Brook Dr Clarks Summit PA NaN NO AIM NaN NaN NaN NaN NaN
1 3 Columbine 11146 66 St NW Edmonton AB 1928.0 YES LS Building permit. Canada NaN NaN NaN NaN
2 4 Belmont (old one) 541 Pine St Ketchikan AK NaN NO LS NaN NaN NaN NaN NaN
3 5 Americus 303 E Samford Ave Auburn AL NaN NO NaN NaN http://photos.al.com/alphotos/2014/09/alabamas... NaN NaN NaN
4 6 Elsmore 608 Brummel Ave Bridgeport AL NaN NO NaN NaN NaN NaN NaN NaN
In [17]:
# Total number of locations
tot = len(pd.unique(address_df["Address"]))
# Number authenticated
yes_n = len(address_df[address_df["Auth"] == 'YES'])
md("The total number of locations contained in the spreadsheet is {:,}. The number that have been authenticated is {:,}.".format(tot, yes_n))
Out[17]:

The total number of locations contained in the spreadsheet is 13,832. The number that have been authenticated is 5,688.

Count of Kit Homes by State

As the chart below shows, Ohio has the most kit homes followed by Illinois, Pennsylvania and New York. Every state appears to have at least one kit home.

These counts include authenticated plus unauthenticated locations.

In [16]:
state_count = address_df['State'].value_counts() 

# Plot a barchart 
figure(figsize=(16, 6))
state_count.plot.bar()
plt.title("Number of Locations by State")
plt.show()

Count of Kit Homes by Model Name

In [18]:
num_models = len(address_df['Model'].value_counts())

md("There are {} models mentioned in the spreadsheet. Some of these are variations "  \
      "of the same model name, for example Concord, Concord/No. 114, Concord/No. 3379.".format(num_models))
Out[18]:

There are 432 models mentioned in the spreadsheet. Some of these are variations of the same model name, for example Concord, Concord/No. 114, Concord/No. 3379.

In [19]:
# List all the model names in alphabetical order.
models_df = pd.DataFrame(pd.unique(address_df["Model"])) # .astype(str).sort())
# models_df[0].sort_values()

Below is a list of the top 50 most frequently mentioned model names in the spreadsheet and how many times each was mentioned.

In [20]:
model_count = address_df['Model'].value_counts().nlargest(50) 

# Plot a barchart 
figure(figsize=(10, 16))
model_count.plot.barh()
plt.title("Number of Kit Homes by Model Name (Top 50)")
plt.show()

Map of Kit Home Locations

Mapping the kit home locations requires having their longitude and latitude. The US Census Bureau provides a service called "Geocoding" for translating a mailing address into a pair of long/lat coordinates. It took around 20 minutes to process the list of 13,000 addresses.

Below is a sample of the spreadsheet data enhanced with the additional information provided by the geocoding.. Scrolling the sample horizontally reveals the additional columns for longitude, latitude, state code, county code and census tract number and Zip Code. More about "census tracts" and "GEOIDs" later.

In [21]:
# Retrieve the coordinates and other results of the geocoding that were previously stored in a computer file.
geocoded_results_df = pd.read_pickle('geocoded_results.pkl')

# Only keep rows that were successfully geocoded
geocoded_results_df = geocoded_results_df[geocoded_results_df["MATCH_INDICATOR"] == "Match"]

# Convert geography code values from numeric to string
geocoded_results_df['FIPS_STATE'] = geocoded_results_df['FIPS_STATE'].astype(int).astype(str)
geocoded_results_df['FIPS_COUNTY'] = geocoded_results_df['FIPS_COUNTY'].astype(int).astype(str)
geocoded_results_df['CENSUS_TRACT'] = geocoded_results_df['CENSUS_TRACT'].astype(int).astype(str)

# Left pad geograpgy values wit zeros
geocoded_results_df['FIPS_STATE'] = geocoded_results_df['FIPS_STATE'].apply('{:0>2}'.format)
geocoded_results_df['FIPS_COUNTY'] = geocoded_results_df['FIPS_COUNTY'].apply('{:0>3}'.format)
geocoded_results_df['CENSUS_TRACT'] = geocoded_results_df['CENSUS_TRACT'].apply('{:0>6}'.format)


# Create a unique geographic identifier by combining state, county and cenus tract code for each row.
geocoded_results_df["GeoID"] = geocoded_results_df["FIPS_STATE"] \
                                + geocoded_results_df["FIPS_COUNTY"] \
                                + geocoded_results_df["CENSUS_TRACT"]

# Split the LONG_LAT column into separate Longitude and Latitude columns
geocoded_results_df[['Longitude', 'Latitude']] = geocoded_results_df['LONG_LAT'].str.rsplit(',', 1, expand=True)

# Merge the Sears Kit Home style for each location from the original address list with the geocoded results.
mapping_data_df = pd.merge(left = address_df[['row_num','Model','Address','City','State','Auth']], 
                           right = geocoded_results_df, 
                           how = 'right', 
                           left_on = 'row_num',
                           right_on = 'ID')

# Examine some of the results
mapping_data_df.head()
Out[21]:
row_num Model Address City State Auth ID ADDRESS_IN MATCH_INDICATOR MATCH_TYPE ADDRESS_OUT LONG_LAT TIGER_EDGE STREET_SIDE FIPS_STATE FIPS_COUNTY CENSUS_TRACT CENSUS_BLOCK Zipcode GeoID Longitude Latitude
0 2 Windsor 105 Meadow Brook Dr Clarks Summit PA NO 2 105 Meadow Brook Dr, Clarks Summit, PA, Match Exact 105 MEADOW BROOK DR, CLARKS SUMMIT, PA, 18411 -75.71287,41.500095 139319156.0 R 42 069 110402 2006.0 18411 42069110402 -75.71287 41.500095
1 4 Belmont (old one) 541 Pine St Ketchikan AK NO 4 541 Pine St, Ketchikan, AK, Match Exact 541 PINE ST, KETCHIKAN, AK, 99901 -131.64699,55.344284 207096132.0 L 02 130 000300 2000.0 99901 02130000300 -131.64699 55.344284
2 5 Americus 303 E Samford Ave Auburn AL NO 5 303 E Samford Ave, Auburn, AL, Match Exact 303 E SAMFORD AVE, AUBURN, AL, 36830 -85.47823,32.59884 1569988.0 L 01 081 040300 2008.0 36830 01081040300 -85.47823 32.59884
3 6 Elsmore 608 Brummel Ave Bridgeport AL NO 6 608 Brummel Ave, Bridgeport, AL, Match Exact 608 BRUMMEL AVE, BRIDGEPORT, AL, 35740 -85.7156,34.947346 58044064.0 R 01 071 950200 1062.0 35740 01071950200 -85.7156 34.947346
4 7 Osborn 708 2nd St SE Cullman AL NO 7 708 2nd St SE, Cullman, AL, Match Exact 708 2ND ST SE, CULLMAN, AL, 35055 -86.83624,34.17931 130220321.0 R 01 043 964901 4032.0 35055 01043964901 -86.83624 34.17931
In [22]:
# Build a list containing all the coordinates so they be plotted on the map
locations = mapping_data_df[['Latitude', 'Longitude']]
locationlist = locations.values.tolist()

How to use the interactive map

The map shows the CONFIRMED (authenticated) kit home locations with DARK BLUE markers and the UNCONFIRMED (not authenticated) kit home locations with LIGHT BLUE markers.

The icon that looks like a stack of square pancakes in the upper right corner of the map is the "layer control". You can use it to hide or show the confirmed and unconfirmed markers.

The + and - icons in the upper left of the map lets you zoom in and out.

The numbers in the blue rectangles represent the number of houses in the group it represents. Clicking on a numbered marker zooms in and separates the big group into smaller groups.

Once you zoom in far enough you will see individual markers that tag a single location. These are the markers with a little "i" in the center. If you click on one of these you will see the address, the model name and whether it has been confirmed.

Something that is kind of fun to do is to highlight the address (just the address) in the pop-up tag, right-click the highlight, and select "Search with Google" (it might say something different depending on your browser). In most cases it will bring up a Google Street View page for the house. There is no charge for this sort of use of Street View.

In [23]:
### Define functions to set the color of cluster markers. Confirmed and unconfirmed locations have 
### different colors. This gets used by all maps.
# This sets the color for CONFIRMED locations clusters.
icon_create_function_confirmed = """
    function(cluster) {
    var childCount = cluster.getChildCount(); 
    /* 
    // comment: can have something like the following to modify the different cluster sizes....
    var c = ' marker-cluster-';

    if (childCount < 50) {
        c += 'large';
    } else if (childCount < 300) {
        c += 'medium';
    } else {
        c += 'small';
    }    
    
    // The marker-cluster-<'size'> gets passed in the "return new L.DivIcon()" function below.
    */
    
    return new L.DivIcon({ html: '<div><span style="background-color:darkblue;color:white;font-size: 20px;">' + childCount + '</span></div>', className: 'marker-cluster', iconSize: new L.Point(40, 30) });    
    }
    """

# This sets the color of UNCONFIRMEDlocation clusters.
icon_create_function_unconfirmed = """
    function(cluster) {
    var childCount = cluster.getChildCount(); 

    return new L.DivIcon({ html: '<div><span style="background-color:lightblue;color:black;font-size: 20px;">' + childCount + '</span></div>', className: 'marker-cluster', iconSize: new L.Point(40, 30) });    
    }
    """
In [24]:
# Create a map using the Map() function and the coordinates of the locations of all the homes.

# Map starts out centered on Ohio.
mp = folium.Map(location=[40.367474, -82.996216], zoom_start=7, width=900, height=550, control_scale=True)
# Ohio_map


# Feature groups allow customization of layer control labels so they don't have to say "macro blah...""
fg_confirmed = folium.FeatureGroup(name = 'Confirmed Locations', show = True)
mp.add_child(fg_confirmed)
fg_unconfirmed = folium.FeatureGroup(name = 'Unconfirmed Locations', show = True)
mp.add_child(fg_unconfirmed)

# Add the Marker clusters for confirmed and unconfirmed locations to feature group
marker_cluster_confirmed = plugins.MarkerCluster(icon_create_function = icon_create_function_confirmed).add_to(fg_confirmed)  
marker_cluster_unconfirmed = plugins.MarkerCluster(icon_create_function=icon_create_function_unconfirmed).add_to(fg_unconfirmed) 

# A function to choose a marker color depending on if the house is a confirmed kit house or not.
# The individual location markers use the same color as their cluster markers.
def getcolor(auth_val):
    if auth_val == 'YES':
        return ("darkblue", "Confirmed")
    return ("lightblue","Unconfirmed")

### Add a layer to the map shpwing Confirmed kit homes
# Loop through all ther location pairs.
for point in range(0, len(locationlist)):
    try:
        clr, status = getcolor(mapping_data_df["Auth"][point])
        if status == "Confirmed":
            folium.Marker(
                location = locationlist[point], 
                popup = status + " " + mapping_data_df['Model'][point] + ": " + mapping_data_df['ADDRESS_OUT'][point],            
                icon = folium.Icon(color = clr)
            ).add_to(marker_cluster_confirmed)

    except Exception:  # not all addresses could be geocoded so skip them if coordinates are missing
        pass
        
### Add a layer to the map showing Unconfirmed kit homes
for point in range(0, len(locationlist)):
    try:
        clr, status = getcolor(mapping_data_df["Auth"][point])
        if status == "Unconfirmed":
            folium.Marker(
                location = locationlist[point], 
                popup = status + " " + mapping_data_df['Model'][point] + ": " + mapping_data_df['ADDRESS_OUT'][point],            
                icon = folium.Icon(color = clr)
            ).add_to(marker_cluster_unconfirmed)

    except Exception:  # not all addresses could be geocoded so skip them if coordinates are missing
        pass
        
# add layer control to map (allows layer to be turned on or off)
folium.LayerControl().add_to(mp)

# Display the map
mp
Out[24]:
Make this Notebook Trusted to load map: File -> Trust Notebook

Most Frequent Street Names

There are anectdotal reports that certain street names are more likely to have kit homes. This might happen if trends in street naming occur over time and certain street names were more popular before 1942, when the last kit home was sold. To see if there is data to support this, I looked at the distribution of kit homes on Ohio street names and it turns out that it seems to be true.

I started by counting the number of residential addresses for each street name in Ohio. Dividing the count for each street name by the total number of residential addresses in Ohio gives a proportion for each street. I did a similar calculation using the number of kit homes on each street name and the total number of kit homes in Ohio. Then I compared the two proportions for each street name. If a street's proportion in the kit home calculation was greater than the proportion for residential addresses overall, it might indicate that a street name is more likely to have a kit home on it.

For the proportion calculations I used both the authenticated and unauthenticated locations. Street names were converted to the base version of the actual street name, without adornments like "DR", "AVE", "N", "S" etc. I did not convert numbered streets like "1st" to "First", which might have improved the results a bit.

To get the number of residential addresses for each street I used a resource called the National Address Database maintained by the US Department of Transportation. See https://www.transportation.gov/gis/national-address-database/national-address-database-0.

In [25]:
# Build an Ohio list of kit house locationss
kitpoints_df = mapping_data_df.loc[(mapping_data_df['MATCH_INDICATOR'] != 'No_Match') & (mapping_data_df['State'] == 'OH'), ['Model', 'Latitude', 'Longitude', 'FIPS_STATE', 'FIPS_COUNTY', 'CENSUS_TRACT', 'GeoID']] # ['Model', 'Latitude', 'Longitude']]
In [26]:
# Save the Ohio points to a csv file so the data can be used by QGIS software later.
kitpoints_df.to_csv("OhioKitPoints.csv", sep = ",", header = True, index = False)
In [27]:
# The next step assumes that I previously downloaded the street addresses for Ohio from the National Address Database 
# and stored them in a CSV file.
# Extract the street name and address type of each address in the state from the csv file.
column_lst = ["StreetName", "Addr_Type"]
nad_addressTypes_df = pd.read_csv("NAD_r7_Ohio.csv", usecols = column_lst)
# Count the occuraces of each address type
addrtype_df = pd.DataFrame(nad_addressTypes_df.groupby(['Addr_Type']).size())

# Assign name to column containing counts
addrtype_df.rename(columns={0: "Count"}, inplace = True)

These are the counts for all the address types in Ohio according to the NAD.

In [28]:
addrtype_df
Out[28]:
Count
Addr_Type
Commercial 286354
Educational 1898
Government 1413
Industrial 735
Other 22863
Residential 3501707
Unknown 854263
In [29]:
state_address_count = addrtype_df.loc["Residential", "Count"] 

md("There are {:,} addresses that are specifically identified as Residential in Ohio according to the \
NAD. I will use {:,} as the denominator when calculating the proportion of residential addresses on each \
street name.  There are {:,} Ohio kit homes in the spreadsheet.".format(state_address_count,state_address_count, num_kithomes_in_state))
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-29-60ae310223b3> in <module>
      3 md("There are {:,} addresses that are specifically identified as Residential in Ohio according to the \
      4 NAD. I will use {:,} as the denominator when calculating the proportion of residential addresses on each \
----> 5 street name.  There are {:,} Ohio kit homes in the spreadsheet.".format(state_address_count,state_address_count, num_kithomes_in_state))

NameError: name 'num_kithomes_in_state' is not defined
In [30]:
# The NAD "StreetName" column provides a basic (without Drive, Street, Avenue, etc) street name for 
# each address, which is what I'll need in my street address frequency analysis.
nad_street_address_count_df = pd.DataFrame(nad_addressTypes_df.loc[nad_addressTypes_df['Addr_Type'] == 'Residential'].groupby(['StreetName']).size().sort_values(ascending=False))
In [31]:
# Getting a similar basic street name for each address in the kit home list is a little more complicated.

# Start with the full addresses
kithome_address_df = mapping_data_df.loc[mapping_data_df['State'] == 'OH', ("State", 'Address')]

# Remove anything that is not a word character or a space
myregex0 = r"[^\w\s]"
kithome_streetnames_ser0 = kithome_address_df["Address"].str.replace(myregex0, '', regex = True, case = False)

# Replace the leading number wih the word ""
myregex1 = r"(^\d+)"
kit_home_streetnames_ser1 = kithome_streetnames_ser0.str.replace(myregex1, '', regex = True, case = False)

# Remove N, E, S, W, St, Ave, Rd, Dr
myregex2 = r"( [nesw]\b)|( dr\b)|( st\b)|( rd\b)|( ave\b)|( pl\b)|( blvd\b)|( se\b)|( ne\b)"
kithome_streetnames_ser2 = kit_home_streetnames_ser1.str.replace(myregex2, '', regex = True, case = False)

num_kithomes_in_state = len(kithome_streetnames_ser2)

To keep things managable I'll focus on 30 street names with the most kit homes.

In [32]:
# Count the kit homes per street and take the top 30 streets.
kithome_top_30_streets_df = pd.DataFrame(kithome_streetnames_ser2.value_counts()).head(30)

# Give the counts column a name
kithome_top_30_streets_df.columns = ["Actual Kit Homes"]

# Give the index a name
kithome_top_30_streets_df.index.name = "StreetName"

# Make index upper case
kithome_top_30_streets_df.index = kithome_top_30_streets_df.index.str.upper()

# Strip whitespace off index values (the street names)
kithome_top_30_streets_df.index = kithome_top_30_streets_df.index.str.strip()
In [33]:
# Get the address count proportions for street names in state overall
ohio_address_count_by_street_df = nad_street_address_count_df

# Add a column heading
ohio_address_count_by_street_df.columns = ["AddressCount"]

# A# Add a column that shows the proportion of addresses for each street in the state 
ohio_address_count_by_street_df['Proportion'] = ohio_address_count_by_street_df['AddressCount'] / state_address_count
In [34]:
md("The table below is a list of street names where the actual number of kit homes exceeded the \
expected number of kit homes. For example, in the first row, for streets named Park there were a total of \
9,700 residential addresses. It represents 0.002770 of all the residential addresses in Ohio {:,}.".format(state_address_count))
Out[34]:

The table below is a list of street names where the actual number of kit homes exceeded the expected number of kit homes. For example, in the first row, for streets named Park there were a total of 9,700 residential addresses. It represents 0.002770 of all the residential addresses in Ohio 3,501,707.

In [35]:
md("You might expect a similar proportion of the {:,} Ohio kit homes to be located on a street named Park. \
That works out to be 7.60. The data shows there are actually 9 homes.  Whether any of these results are \
statistically significant is not something I can say. Also, these results are just for one state but \
it shows it could be worthwhile to expand the street analysis to the rest of the country.".format(num_kithomes_in_state))
Out[35]:

You might expect a similar proportion of the 2,745 Ohio kit homes to be located on a street named Park. That works out to be 7.60. The data shows there are actually 9 homes. Whether any of these results are statistically significant is not something I can say. Also, these results are just for one state but it shows it could be worthwhile to expand the street analysis to the rest of the country.

In [36]:
# Calculate the number of kits homes expected for each Ohio Street
ohio_address_count_by_street_df["Expected Kit Homes"] = round(ohio_address_count_by_street_df["Proportion"] * num_kithomes_in_state, 2)

# Merge the actual counts of kit homes by street name and compare to the expected number.
comparison3_df = ohio_address_count_by_street_df.join(kithome_top_30_streets_df, lsuffix='_State', rsuffix='_Kits')


comparison3_df[comparison3_df["Actual Kit Homes"] - comparison3_df["Expected Kit Homes"] > 0]
Out[36]:
AddressCount Proportion Expected Kit Homes Actual Kit Homes
StreetName
PARK 9700 0.002770 7.60 9.0
MAPLE 9136 0.002609 7.16 15.0
WASHINGTON 9030 0.002579 7.08 9.0
LINCOLN 8373 0.002391 6.56 9.0
CLEVELAND 6511 0.001859 5.10 16.0
BROADWAY 6103 0.001743 4.78 9.0
OAK 5713 0.001631 4.48 10.0
CHURCH 5176 0.001478 4.06 9.0
RIVER 4544 0.001298 3.56 13.0
FOREST 2986 0.000853 2.34 9.0
WARREN 2182 0.000623 1.71 12.0
CAMBRIDGE 1838 0.000525 1.44 17.0
WOOSTER 1721 0.000491 1.35 11.0
HUNTER 1103 0.000315 0.86 10.0
HILLSIDE 885 0.000253 0.69 12.0
WYOMING 855 0.000244 0.67 10.0
SEYMOUR 539 0.000154 0.42 8.0
SUTTON 472 0.000135 0.37 11.0
BEACON 415 0.000119 0.33 13.0
JOSEPH 391 0.000112 0.31 9.0
PONDVIEW 181 0.000052 0.14 9.0
CARTHAGE 145 0.000041 0.11 10.0
FOURTEENTH 137 0.000039 0.11 14.0
ELBERON 86 0.000025 0.07 9.0
KRYDER 49 0.000014 0.04 9.0
HOLLIBAUGH 48 0.000014 0.04 9.0

Distance to railroad

Sears kit homes were delivered from the factory by railroad freight car. Areas closer to railroads may be more likely to have kit homes due to the effort and expense involved in moving the materials from the railroad to the building site.

The US Census Bureau provides mapping files for railroad tracks. Using QGIS geospatial software I calculated the shortest distance between a kit home and the nearest railroad track. Then I calculated what percentage of kit homes were found with a mile, two miles etc, of railroad tracks. I used Ohio as the test for this.

In [37]:
# Analyze the distances that were found.  The distances are Euclidean distance, or "as the crow flies" 
# distance between a house and the nearest railroad track.  It ignores whether the nearest RR found is at a 
# spur or yard where unloading could actually take place.

# Load the CSV file contianing the distance measures.
distances_df = pd.read_csv("~/Documents/CensusUS/Hub Distances.csv")

The following chart shows that in Ohio a large majority of the kit homes are within one or two miles of a railroad track. 63 percent are within 1 mile of a railroad. 79 percent are within 2 miles of a railroad.

In [38]:
plt.figure(figsize=(12, 5))
my_bins = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]
arr = plt.hist(distances_df["HubDist"], bins = my_bins)
plt.xticks([0,1,2,3,4,5,6,7,8,9,10,11,12])

for i in range(len(my_bins) - 1):
    plt.text(x = arr[1][i] + 0.2, y = arr[0][i] + 15, s = str(int(arr[0][i])), fontsize = 12)
    # plt.xlabel('Value')

plt.ylabel('Number of Houses')
plt.xlabel('Range of Distances to Nearest Railroad - Miles')

plt.show()
In [39]:
# Create a dataframe of the cummulative percents for each bin of distance
lst = []  # list object to hold the rows of calculations
cummul = 0
for i in range(len(my_bins) - 1):    
    cummul = cummul + int(arr[0][i])
    # Append the row of calculations to the main ;ist 
    lst.append([str(i) + ' to ' + str(i + 1), int(arr[0][i]), cummul, round(cummul / len(distances_df), 3) * 100])
    
# Create the DataFrame from the list of calculations
distance_df = pd.DataFrame(lst , columns = ['Distance to RR in Miles', 'Count', 'Cummulative Count', 'Cummulative Percent'])

Using Distance to Railroad Track for Prediction

The Census Bureau provides the longitude and latitude of the geocentric center of census tracts. I used those points in a QGIS map layer to find the distance to railroad tracks. The census tracts that have a center within 1 miles of a RR track are shaded on the map.

The map offers some visual confirmation that Sears kit homes are predominately in areas that are within a mile or two of a reailroad track.

In [61]:
# Build the list of locations of homes in Ohio.
ohio_mapping_data_df = mapping_data_df[mapping_data_df["State"] == "OH"]
# Reset the index for this subset of data
ohio_mapping_data_df.reset_index(drop=True, inplace=True)

# Build a list containing all the coordinates so they be plotted on the map
ohio_locations = ohio_mapping_data_df[['Latitude', 'Longitude']]
ohio_locationlist = ohio_locations.values.tolist()
In [64]:
# The following assumes I previously created a geojson map layer of the census tracts within 1 mile using
# QGIS so all I need to do here is load it and display it.

map_1mile = folium.Map(location=[40.367474, -82.996216], zoom_start=7, width=900, height=550, control_scale=True)

# Define a function called when the census tract layer file is added that modifies its appearance.
def censusTract_style(feature):
    return {
        "weight": 1,
        "color": "darkblue",
        "fill": True,
    }

# Add geojson file census tracts to map using the style function defined above.
folium.GeoJson('CensusTractsWithin1Mile.geojson', name='Ohio Census Tracts Within 1 mile of RR', style_function = censusTract_style).add_to(map_1mile)


# Feature groups allow customization of layer control labels so they don't have to say "macro blah...""
ohio_fg_confirmed = folium.FeatureGroup(name = 'Confirmed Locations', show = True)
map_1mile.add_child(ohio_fg_confirmed)
ohio_fg_unconfirmed = folium.FeatureGroup(name = 'Unconfirmed Locations', show = True)
map_1mile.add_child(ohio_fg_unconfirmed)

# Add the Marker clusters for confirmed and unconfirmed locations to feature group
ohio_marker_cluster_confirmed = plugins.MarkerCluster(icon_create_function = icon_create_function_confirmed).add_to(ohio_fg_confirmed)  
ohio_marker_cluster_unconfirmed = plugins.MarkerCluster(icon_create_function=icon_create_function_unconfirmed).add_to(ohio_fg_unconfirmed) 

# A function to choose a marker color depending on if the house is a confirmed kit house or not.
# The individual location markers use the same color as their cluster markers.
def getcolor(auth_val):
    if auth_val == 'YES':
        return ("darkblue", "Confirmed")
    return ("lightblue","Unconfirmed")

### Add a layer to the map shpwing Confirmed kit homes
# Loop through all ther location pairs.
for point in range(0, len(ohio_locationlist)):
    try:
        clr, status = getcolor(ohio_mapping_data_df["Auth"][point])
        if status == "Confirmed":
            folium.Marker(
                location = ohio_locationlist[point], 
                popup = status + " " + ohio_mapping_data_df['Model'][point] + ": " + ohio_mapping_data_df['ADDRESS_OUT'][point],            
                icon = folium.Icon(color = clr)
            ).add_to(ohio_marker_cluster_confirmed)

    except Exception:  # not all addresses could be geocoded so skip them if coordinates are missing
        pass
        
### Add a layer to the map showing Unconfirmed kit homes
for point in range(0, len(ohio_locationlist)):
    try:
        clr, status = getcolor(ohio_mapping_data_df["Auth"][point])
        if status == "Unconfirmed":
            folium.Marker(
                location = ohio_locationlist[point], 
                popup = status + " " + ohio_mapping_data_df['Model'][point] + ": " + ohio_mapping_data_df['ADDRESS_OUT'][point],            
                icon = folium.Icon(color = clr)
            ).add_to(ohio_marker_cluster_unconfirmed)

    except Exception:  # not all addresses could be geocoded so skip them if coordinates are missing
        pass
        
# add layer control to map (allows layer to be turned on or off)
folium.LayerControl().add_to(map_1mile)

map_1mile
Out[64]:
Make this Notebook Trusted to load map: File -> Trust Notebook

Socioeconomic characteristics

Use US Census Bureau data to analyze social, economic and demographic characteristics of locations of Sears kit homes.

For each Sears kit home identify the State, County and Census Tract of its location.

For each State, County and Census Tract, find the median household income, median household size, types of housing as percent, percent of population with college education, average age, percent of population that is non-white, average time spent commuting to work.

  • For each Census Tract, is median household income greater or less than that of County? State?
  • For each Census Tract, is median household size greater or less than that of County? State?
  • For each Census Tract, is percent of different types of housing greater or less than that of County? State?
  • For each Census Tract, is percent with high graduation greater or less than that of County? State?
  • For each Census Tract, is average age greater or less than that of County? State?
  • For each Census Tract, is percent of population that is non-white greater or less than that of County? State?
  • For each Census Tract, is average time spent commuting greater or less than that of County? State?

The Census Bureau identifier (FIPS code) for Ohio is "39".

The data source will be the American Community Survey 5 Year Estimate for 2019.

Percent High School or Higher: Table "S1501", variable "S1501_C02_014E"
Median Household Income: Table "DP03", variable "DP03_0062E"

In [ ]:
# URL to get county names and FIPS codes for each state
# https://api.census.gov/data/2010/dec/sf1?get=NAME&for=county:* 


# Components of the base URL go into variables.
HOST = "https://api.census.gov/data"
year = "2010"
dataset = "dec/sf1"

# Code to join the base url components
base_url = "/".join([HOST, year, dataset])

# Form the predicates dictionary
predicates = {}  # initialize empty dictionary
get_vars = ["NAME"]   # Want the name of the State and the total population.
predicates["get"] = ",".join(get_vars)
predicates["for"] = "county:*"   # States are the geography to return data for. "*" means all states.

# Make the request.  Results get loaded into a response variable as a list of lists
list_of_lists = requests.get(base_url, params = predicates).json()  # need .json() to force a json format
In [ ]:
# Display some results after list of lists is converted to a dataframe.
list_of_lists.pop(0) 
counties_df = pd.DataFrame(list_of_lists, columns = ["County State", "State FIPS", "County FIPS"])
counties_df.head()
In [ ]:
# Get just the Ohio counties
ohio_counties_df = counties_df[counties_df["State FIPS"] == "39"].sort_values(by = ["County FIPS"])

Define a Function to Retrieve Census Data

I define a function that takes as inputs a single table code, a variable code belonging to that table representing the census statistic I want, a year, a geographic level and a state identifier. The table must start with one of these prefix letter codes: B, S or DP. It will return data for all Delaware census tracts for those variables. It will return data for the single state specified.

The function will only work for subject tables, base tables and data profile tables.

The only way to determine what the correct variable code is for each year is by manually searching each year's table structure using the groups Web page: https://api.census.gov/data/2018/acs/acs5/groups.html. In this URL you would replace the 2018 with another year and search for the table and look at the variable names. Do that for each year you are extracting data.

In [ ]:
# Define function to retrieve US Census data for a specified year
def fn_state_tract_data(tablecode, variable_code_list, census_year, geolevel, state_fips):

    # INPUTS:
    # tablecode: a single US Census table code that starts with B, S or D or P
    #
    # variable_code_list: a list of one or more US Census Bureau data variable codes, Example "[DP05_0060PE]"
    #
    # census_year: a 4 digit year
    #
    # geolevel: the geography type. Should be one of these: tract, county, state
    #           Specify "tract" for census tract level results. Data for all tracts will be returned
    #           Specify "county" for county level results. Data for all counties will be returned
    #           Specify "state" for state level results. Data for the single state specified will be returned
    #
    # state_fips: the FIPS code for the state you want data for
    #
    # Output:
    # A single dataframe containing data for all the variables for all the years


    # Components of the base URL go into variables.
    HOST = "https://api.census.gov/data"
    year = census_year
    
    # Make sure request is for a Base, Subject or Data Profile table as indicated by the leading letters
    # in the table code.
    s = tablecode[0]    
    if s not in ("S", "D", "B", "P"):
        print("Error: table did not start with B, S, D or P")
        return
    
    # The leading letter determines the base of the URL
    if s == "S":
        dataset = "acs/acs5/subject"  
    elif s == "D":                              
        dataset = "acs/acs5/profile"  
    elif s == "B":                             
        dataset = "acs/acs5/"
    elif s == "P":                             
        dataset = "dec/sf1/"        

    # Code to join the base url components
    base_url = "/".join([HOST, year, dataset])
    
    # Form the predicates dictionary depending on the geolevel requested
    predicates = {}  # initialize empty dictionary
    # get_vars = ["NAME", variable_code]   # 1st column will be "NAME" of the geography, then the Census data.
    get_vars = ["NAME"]       # 1st column will be "NAME" of the geography, then the Census data.
    for var in variable_code_list:
        get_vars.append(var)    # Add the others varaibles 
    predicates["get"] = ",".join(get_vars)
    if geolevel == "tract" or geolevel == "county":
        predicates["for"] = geolevel + ":*"   # the desired geography. "*" means all.
        predicates["in"] = "state:" + state_fips   # the State to return data for.
    if geolevel == "state":
        predicates["for"] = geolevel + ":" + state_fips  # get a single State value. No "in" predicate needed.
        
    # Make the request.  Results get loaded into a response object, "myresponse".
    myresponse = requests.get(base_url, params = predicates)
    
    # Make a data frame out of the response.  The first row is the column names.
    df = pd.DataFrame(columns = myresponse.json()[0], data = myresponse.json()[1:])
    
    return df

Notes about figuring out Census table codes

Here is the process I used. Go to https://data.census.gov/cedsci/advanced (the American Community Survey advanced search page). Explore with the tool until I find the data I want. Once I see the data I make sure I am using a 5-year estimate table and it is for the year 2019. I make a note of the table code and the exact wording of the label used in the data display.

To get the variable code for the previously noted label, I copy and paste one of these URLs into a browser search.

If the leading letter of the table code is "S" https://api.census.gov/data/2019/acs/acs5/subject/groups.html

If the leading letter of the table code is "D" https://api.census.gov/data/2019/acs/acs5/profile/groups.html

If the leading letter of the table code is "B" https://api.census.gov/data/2019/acs/acs5/groups.html

If the leading letter of the table code is "P" https://api.census.gov/data/2010/dec/sf1/groups.html (the year must be a decennial year)

Scroll down to the table code in the "Name" column.

Click on the "Selected Variables" link.

Search the "Label" column for the text of the previously noted data label. Look for a code in the "Name" column that ends with "E" (E stands for "Estimate", to distinguish it from margins of error, annotations and other values)

Number of single family residences per census tract

This number will be used as the denominator when comparing the statistics of kit homes with the state as a whole. For example when comparing median year built of kit homes to the median year built of all homes, the median year built for all homes will be weighted using the number of residences in each census tract.

TODO

Percent adults age 25 yrs or older with high school graduation or higher attainment

ACS 5-year table S1501. Variable S1501_C02_014E Estimate!!Percent!!Population 25 years and over!!Percent high school graduate or higher.

The results are percent with high school graduation or higher.

Variable S1501_C02_015E, Estimate!!Percent!!AGE BY EDUCATIONAL ATTAINMENT!!Population 25 years and over!!Bachelor's degree or higher

In [ ]:
# Get the education attainment data at the State level in Ohio
dfs = []  # Initialize an empty list that will contain the data frames for all the years

### Get the census data and put it into dataframe ###

table = "S1501"
variable = ["S1501_C02_015E"]
years = "2019"  # The most recent year data available
geolevel = "state"  
state = "39" # Ohio

state_education_df = fn_state_tract_data(tablecode = table, 
                    variable_code_list = variable, 
                    census_year = years, 
                    geolevel = geolevel,
                    state_fips = state)
In [ ]:
state_education_df
In [ ]:
# Get the education attainment data at the County level in Ohio
dfs = []  # Initialize an empty list that will contain the data frames for all the years

### Get the census data and put it into dataframe ###

table = "S1501"
variable = ["S1501_C02_015E"]
years = "2019"  # The most recent year data available
geolevel = "county"  
state = "39" # Ohio

county_education_df = fn_state_tract_data(tablecode = table, 
                    variable_code_list = variable, 
                    census_year = years, 
                    geolevel = geolevel,
                    state_fips = state)
In [ ]:
# Examine the result
county_education_df.head()
In [ ]:
# Get the education attainment data at the Census Tract level in Ohio for all Census Tracts
dfs = []  # Initialize an empty list that will contain the data frames for all the years

### Get the census data and put it into dataframe ###

table = "S1501"
variable = ["S1501_C02_015E"]
years = "2019"  # The most recent year data available
geolevel = "tract"  
state = "39" # Ohio

# Get the edication level statistic for each census tract in Ohio.
tract_education_df = fn_state_tract_data(tablecode = table, 
                    variable_code_list = variable, 
                    census_year = years, 
                    geolevel = geolevel,
                    state_fips = state)

# Add unique identifier for Census Tract that combines state, county and tract numbers.
tract_education_df["GeoID"] = tract_education_df["state"] \
                                + tract_education_df["county"] \
                                + tract_education_df["tract"]

# Make the statistic a number instead of a string
tract_education_df["S1501_C02_015E"] = tract_education_df["S1501_C02_015E"].astype(float)

# Give the statistic code a descriptive name
tract_education_df.rename(columns={'S1501_C02_015E': 'Bachelor Degree or Higher Pct'}, inplace=True)
In [ ]:
# Examine the results
print("There are", len(tract_education_df), "Census Tracts in Ohio")
tract_education_df.head()

Explore statistics of census tracts

See if things like income, age, education etc, are significantly different between census tracts that have kit homes, and ones that do not.

In [ ]:
# Examine the list of kit homes in Ohio and where they are located
kitpoints_df.head()
In [ ]:
# Add the education statistics to the kit home census tract locations dataframe
# Merge the Sears Kit Home style for each location from the original data with the geocoded data.
kitpoints_stats_df = pd.merge(left = kitpoints_df, 
                           right = tract_education_df[["Bachelor Degree or Higher Pct", "GeoID"]], 
                           how = 'left', 
                           left_on = 'GeoID',
                           right_on = 'GeoID')

Compare the distribution of high school graduation percents in Ohio census tracts that have kit homes with census tracts that do not have kit homes.

In [ ]:
# Build list of census tracts that have kit homes
tracts_with_kits_df = kitpoints_stats_df.loc[kitpoints_stats_df["Bachelor Degree or Higher Pct"] > 0, ["GeoID", "Bachelor Degree or Higher Pct"]].drop_duplicates()
In [ ]:
print("There are", len(tracts_with_kits_df), "Census Tracts that have kit homes")
In [ ]:
# First build a list of ALL census tracts in Ohio
all_census_tracts_df = tract_education_df.loc[tract_education_df["Bachelor Degree or Higher Pct"]>0,["GeoID", "Bachelor Degree or Higher Pct"]]

# Get the list of tracts from the all tracts list that are not in the with kit home list
tracts_without_kits_df = all_census_tracts_df[~all_census_tracts_df["GeoID"].isin(tracts_with_kits_df["GeoID"])]
In [ ]:
sns.distplot(tracts_with_kits_df["Bachelor Degree or Higher Pct"], hist=False, kde=True,              
             hist_kws={'edgecolor':'black'},
             kde_kws={'shade': True, 'linewidth': 3})


sns.distplot(tracts_without_kits_df["Bachelor Degree or Higher Pct"], hist=False, kde=True,  
             hist_kws={'edgecolor':'black'},
             kde_kws={'shade': True, 'linewidth': 3})

Percent of Population with Bachelor Degree or Higher in State of Ohio

In [ ]:
# the State level statistic
state_education_df
In [ ]:
# The median value across all census tracts
tract_education_df["Bachelor Degree or Higher Pct"].median()

Median Percent of Population with High School Graduation of Higher for Census Tracts With Kit Homes

In [ ]:
tracts_without_kits_df["Bachelor Degree or Higher Pct"].median()

Based on the above, educational attainment does not look like a good indicator of census tracts likely to have kit homes.

Household Income

Perform an analysis of household income of census tracts that have kit homes.

Use ACS 5-year table DP03. DP03_0062E Estimate!!INCOME AND BENEFITS (IN 2019 INFLATION-ADJUSTED DOLLARS)!! otal households!!Median household income (dollars)

In [ ]:
# Get the median household income for the state of Ohio as a whole.

### Get the census data and put it into dataframe ###

table = "DP03"
variable = ["DP03_0062E"]
years = "2019"  # The most recent year data available
geolevel = "state"  
state = "39" # Ohio

state_income_df = fn_state_tract_data(tablecode = table, 
                    variable_code_list = variable, 
                    census_year = years, 
                    geolevel = geolevel,
                    state_fips = state)

# Make the statistic a number instead of a string
state_income_df["DP03_0062E"] = state_income_df["DP03_0062E"].astype(float)

# Give the statistic code a descriptive name
state_income_df.rename(columns={'DP03_0062E': 'Median Household Income'}, inplace=True)
In [ ]:
state_income_df

state_median_income = "${:,.2f}".format(state_income_df['Median Household Income'].median())

print("The median household income in Ohio is", state_median_income)
In [ ]:
# Get the median household income for the census tracts in Ohio.

### Get the census data and put it into dataframe ###

table = "DP03"
variable = ["DP03_0062E"]
years = "2019"  # The most recent year data available
geolevel = "tract"  
state = "39" # Ohio

# Get the edication level statistic for each census tract in Ohio.
tract_income_df = fn_state_tract_data(tablecode = table, 
                    variable_code_list = variable, 
                    census_year = years, 
                    geolevel = geolevel,
                    state_fips = state)

# Add unique identifier for Census Tract that combines state, county and tract numbers.
tract_income_df["GeoID"] = tract_income_df["state"] \
                                + tract_income_df["county"] \
                                + tract_income_df["tract"]

# Make the statistic a number instead of a string
tract_income_df["DP03_0062E"] = tract_income_df["DP03_0062E"].astype(float)

# Give the statistic code a descriptive name
tract_income_df.rename(columns={'DP03_0062E': 'Median Household Income'}, inplace=True)
In [ ]:
# Add the income statistic to the kit home dataframe
kitpoints_stats_df = pd.merge(left = kitpoints_stats_df, 
                           right = tract_income_df[["Median Household Income", "GeoID"]], 
                           how = 'left', 
                           left_on = 'GeoID',
                           right_on = 'GeoID')
In [ ]:
# Rank the kit homes by the income of their census tracts and take the middle value to get the 
# median household income of kit home ocations.

kit_median_income = "${:,.2f}".format(kitpoints_stats_df['Median Household Income'].median())
print("The median household income of kit home locations in Ohio is", kit_median_income)

The median household income of kit home locations is about $9,000 less than the median income for Ohio overall. This is about a 16\% difference, so this is a variable that could be useful in the predictive model.

Neighborhoods built 1908 to 1940

How can neighborhoods being developed during this period be identified? Possibilities might be to find public structures like schools and fire houses built during that time.

The US Census Bureau asks when a residence was built in its annual American Community Survey. The median age of the structures in a census tract could be used as an indicator of how old a developed area is.

Year Structure Built

One of the questions in the annual American Community Survey relates to the age of residential structures. The possible responses are broken down into the following categories:

2014 or later
2010 to 2013
2000 to 2009
1980 to 1999
1960 to 1979
1940 to 1959
1939 or earlier

Since Sears only sold kit homes between 1908 and 1942 I used the results of the 1939 or earlier group as the closest

ACS table S2504. S2504_C02_015E, Estimate!!Percent occupied housing units!!Occupied housing units!!YEAR STRUCTURE BUILT!!1939 or earlier

In [ ]:
# Get the percent of structures built 1939 or earlier for all Ohio.

### Get the census data and put it into dataframe ###

table = "S2504"
variable = ["S2504_C02_015E"]
years = "2019"  # The most recent year data available
geolevel = "state"  
state = "39" # Ohio

# Get the edication level statistic for each census tract in Ohio.
state_yearBuilt_df = fn_state_tract_data(tablecode = table, 
                    variable_code_list = variable, 
                    census_year = years, 
                    geolevel = geolevel,
                    state_fips = state)

# Add unique identifier for Census Tract that combines state, county and tract numbers.
state_yearBuilt_df["GeoID"] = state_yearBuilt_df["state"]
                               

# Make the statistic a number instead of a string
state_yearBuilt_df["S2504_C02_015E"] = state_yearBuilt_df["S2504_C02_015E"].astype(float)

# Give the statistic code a descriptive name
state_yearBuilt_df.rename(columns={'S2504_C02_015E': 'Percent Built 1939 and Earlier'}, inplace=True)
In [ ]:
# Get the percent structure built 1939 and earlier for the census tracts in Ohio.

### Get the census data and put it into dataframe ###

table = "S2504"
variable = ["S2504_C02_015E"]
years = "2019"  # The most recent year data available
geolevel = "tract"  
state = "39" # Ohio

# Get the edication level statistic for each census tract in Ohio.
tract_yearBuilt_df = fn_state_tract_data(tablecode = table, 
                    variable_code_list = variable, 
                    census_year = years, 
                    geolevel = geolevel,
                    state_fips = state)

# Add unique identifier for Census Tract that combines state, county and tract numbers.
tract_yearBuilt_df["GeoID"] = tract_yearBuilt_df["state"] \
                                + tract_yearBuilt_df["county"] \
                                + tract_yearBuilt_df["tract"]

# Make the statistic a number instead of a string
tract_yearBuilt_df["S2504_C02_015E"] = tract_yearBuilt_df["S2504_C02_015E"].astype(float)

# Give the statistic code a descriptive name
tract_yearBuilt_df.rename(columns={'S2504_C02_015E': 'Percent Built 1939 and Earlier'}, inplace=True)
In [ ]:
tract_yearBuilt_df.head()
In [ ]:
# Save this result to a CSV file.  It will be used later to build a cloropleth map.
tract_yearBuilt_df.to_csv("tract_yearBuilt_df.csv")
In [ ]:
# Add the year built statistic to the kit homes dataset.
# Add the income statistic to the kit home dataframe
kitpoints_stats_df = pd.merge(left = kitpoints_stats_df, 
                           right = tract_yearBuilt_df[["Percent Built 1939 and Earlier", "GeoID"]], 
                           how = 'left', 
                           left_on = 'GeoID',
                           right_on = 'GeoID')
In [ ]:
# Examine the results
kitpoints_stats_df.head()
In [ ]:
# Rank the kit homes by percent built 1939 or earlier of their census tracts and take the middle value to 
# get the median percent built 1939 or earlier of kit home locations.

kit_median_yearBuilt = "{:,.2f}".format(kitpoints_stats_df['Percent Built 1939 and Earlier'].median())
print("The median 'Percent Built 1939 and Earlier' of kit home census tracts in Ohio is", kit_median_yearBuilt)
In [ ]:
s = tract_yearBuilt_df.loc[tract_yearBuilt_df["Percent Built 1939 and Earlier"] > 0, ["Percent Built 1939 and Earlier"]]
In [ ]:
sns.distplot(kitpoints_stats_df["Percent Built 1939 and Earlier"], hist=False, kde=True,              
             hist_kws={'edgecolor':'black'},
             kde_kws={'shade': True, 'linewidth': 3})

# Plot the distribution of Percent Built 1939 and Earlier for all Ohio census tracts                       
sns.distplot(s, hist=False, kde=True,              
             hist_kws={'edgecolor':'black'},
             kde_kws={'shade': True, 'linewidth': 3})

Census tract data on when structures were built may be a useful predictor. Census tracts containing kit homes have a higher percentage of structures built in 1939 or earlier than census tracts in general in Ohio.

Percent Renters

There is a bug in the Census data where Variable S2502_C06_001E is a count, not a percent, as the label would indicate. Need to divide that count by the total number of occupied housing units to get the percent of housing that is rented.

ACS 5-Year table: S2502. Variable: S2502_C06_001E, Estimate!!Percent renter-occupied housing units!!Occupied housing units divided by variable S2502_C01_001E, Estimate!!Occupied housing units!!Occupied housing units.

In [ ]:
### Get the Percent Renter Occupied Housing for state as a whole ###

table = "S2502"
variable = ["S2502_C01_001E", "S2502_C06_001E"]
years = "2019"  # The most recent year data available
geolevel = "state"  
state = "39" # Ohio

# Get the edication level statistic for each census tract in Ohio.
state_renters_df = fn_state_tract_data(tablecode = table, 
                    variable_code_list = variable, 
                    census_year = years, 
                    geolevel = geolevel,
                    state_fips = state)

# Add unique identifier for Census Tract that combines state, county and tract numbers.
state_renters_df["GeoID"] = state_renters_df["state"]
                               

# Make the statistics a number instead of a string
state_renters_df["S2502_C01_001E"] = state_renters_df["S2502_C01_001E"].astype(float)
state_renters_df["S2502_C06_001E"] = state_renters_df["S2502_C06_001E"].astype(float)

# Give the statistic codes a descriptive name
state_renters_df.rename(columns={'S2502_C01_001E': 'Total Occupied Housing'}, inplace=True)
state_renters_df.rename(columns={'S2502_C06_001E': 'Renter Occupied Housing'}, inplace=True)
In [ ]:
state_renters_df["Percent Renter Occupied Housing"] = state_renters_df['Renter Occupied Housing']/state_renters_df['Total Occupied Housing']
In [ ]:
state_renters_df
In [ ]:
# Get the percent rented housing for the census tracts in Ohio.

### Get the census data and put it into dataframe ###

table = "S2502"
variable = ["S2502_C01_001E", "S2502_C06_001E"]
years = "2019"  # The most recent year data available
geolevel = "tract"  
state = "39" # Ohio

# Get the edication level statistic for each census tract in Ohio.
tract_renters_df = fn_state_tract_data(tablecode = table, 
                    variable_code_list = variable, 
                    census_year = years, 
                    geolevel = geolevel,
                    state_fips = state)

# Add unique identifier for Census Tract that combines state, county and tract numbers.
tract_renters_df["GeoID"] = tract_yearBuilt_df["state"] \
                                + tract_yearBuilt_df["county"] \
                                + tract_yearBuilt_df["tract"]

# Make the statistic a number instead of a string
tract_renters_df["S2502_C01_001E"] = tract_renters_df["S2502_C01_001E"].astype(float)
tract_renters_df["S2502_C06_001E"] = tract_renters_df["S2502_C06_001E"].astype(float)

# Give the statistic code a descriptive name
tract_renters_df.rename(columns={'S2502_C01_001E': 'Total Occupied Housing'}, inplace=True)
tract_renters_df.rename(columns={'S2502_C06_001E': 'Renter Occupied Housing'}, inplace=True)
In [ ]:
tract_renters_df["Percent Renter Occupied Housing"] = tract_renters_df['Renter Occupied Housing']/tract_renters_df['Total Occupied Housing']
In [ ]:
tract_renters_df.head()
In [ ]:
# Add the percent rented statistic to the kit homes dataset.
kitpoints_stats_df = pd.merge(left = kitpoints_stats_df, 
                           right = tract_renters_df[["Percent Renter Occupied Housing", "GeoID"]], 
                           how = 'left', 
                           left_on = 'GeoID',
                           right_on = 'GeoID')
In [ ]:
kitpoints_stats_df.head()
In [ ]:
# Rank the kit homes by percent rented of their census tracts and take the middle value to 
# get the median percent rented for kit home locations.

kit_median_rented = "{:,.2f}".format(kitpoints_stats_df['Percent Renter Occupied Housing'].median())
print("The median 'Percent Renter Occupied Housing' of kit home census tracts in Ohio is", kit_median_rented)

Mapping Kit Home Predictors

Distance to Railroads

In [ ]:
# Create a map using the Map() function and the coordinates of the locations of all the homes.

# Map starts out centered on Ohio.
map_rr = folium.Map(location=[40.367474, -82.996216], zoom_start=7, width=900, height=550, control_scale=True)
# Ohio_map

### Define functions to set the color of cluster markers. Confirmed and unconfirmed locations have 
### different colors.
# This sets the color for CONFIRMED locations clusters.
icon_create_function_confirmed = """
    function(cluster) {
    var childCount = cluster.getChildCount(); 
    /* 
    // comment: can have something like the following to modify the different cluster sizes....
    var c = ' marker-cluster-';

    if (childCount < 50) {
        c += 'large';
    } else if (childCount < 300) {
        c += 'medium';
    } else {
        c += 'small';
    }    
    
    // The marker-cluster-<'size'> gets passed in the "return new L.DivIcon()" function below.
    */
    
    return new L.DivIcon({ html: '<div><span style="background-color:darkblue;color:white;font-size: 20px;">' + childCount + '</span></div>', className: 'marker-cluster', iconSize: new L.Point(40, 30) });    
    }
    """

# This sets the color of UNCONFIRMEDlocation clusters.
icon_create_function_unconfirmed = """
    function(cluster) {
    var childCount = cluster.getChildCount(); 

    return new L.DivIcon({ html: '<div><span style="background-color:lightblue;color:black;font-size: 20px;">' + childCount + '</span></div>', className: 'marker-cluster', iconSize: new L.Point(40, 30) });    
    }
    """

# Feature groups allow customization of layer control labels so they don't have to say "macro blah...""
fg_confirmed = folium.FeatureGroup(name = 'Confirmed Locations', show = True)
map_rr.add_child(fg_confirmed)
fg_unconfirmed = folium.FeatureGroup(name = 'Unconfirmed Locations', show = True)
map_rr.add_child(fg_unconfirmed)

# Add the Marker clusters for confirmed and unconfirmed locations to feature group
marker_cluster_confirmed = plugins.MarkerCluster(icon_create_function = icon_create_function_confirmed).add_to(fg_confirmed)  
marker_cluster_unconfirmed = plugins.MarkerCluster(icon_create_function=icon_create_function_unconfirmed).add_to(fg_unconfirmed) 

# A function to choose a marker color depending on if the house is a confirmed kit house or not.
# The individual location markers use the same color as their cluster markers.
def getcolor(auth_val):
    if auth_val == 'YES':
        return ("darkblue", "Confirmed")
    return ("lightblue","Unconfirmed")

### Add a layer to the map shpwing Confirmed kit homes
# Loop through all ther location pairs.
for point in range(0, len(locationlist)):
    try:
        clr, status = getcolor(mapping_data_df["Auth"][point])
        if status == "Confirmed":
            folium.Marker(
                location = locationlist[point], 
                popup = status + " " + mapping_data_df['Model'][point] + ": " + mapping_data_df['ADDRESS_OUT'][point],            
                icon = folium.Icon(color = clr)
            ).add_to(marker_cluster_confirmed)

    except Exception:  # not all addresses could be geocoded so skip them if coordinates are missing
        pass
        
### Add a layer to the map showing Unconfirmed kit homes
for point in range(0, len(locationlist)):
    try:
        clr, status = getcolor(mapping_data_df["Auth"][point])
        if status == "Unconfirmed":
            folium.Marker(
                location = locationlist[point], 
                popup = status + " " + mapping_data_df['Model'][point] + ": " + mapping_data_df['ADDRESS_OUT'][point],            
                icon = folium.Icon(color = clr)
            ).add_to(marker_cluster_unconfirmed)

    except Exception:  # not all addresses could be geocoded so skip them if coordinates are missing
        pass

# Define a function called when the census tract layer file is added that modifies its appearance.
def censusTract_style(feature):
    return {
        "weight": 1,
        "color": "darkblue",
        "fill": True,
    }

# Add geojson file census tracts to map using the style function defined above.
folium.GeoJson('CensusTractsWithin1Mile.geojson', name='Ohio Census Tracts', style_function = censusTract_style).add_to(map_rr)


# add layer control to map (allows layer to be turned on or off)
folium.LayerControl().add_to(map_rr)

# Display the map
map_rr
In [ ]: